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Abstract 

The problem of hierarchical clustering items from pairwise similarities is found across various scientific 
disciplines, from biology to networking. Often, applications of clustering techniques are limited by the 
cost of obtaining similarities between pairs of items. While prior work has been developed to reconstruct 
clustering using a significantly reduced set of pairwise similarities via adaptive measurements, these 
techniques are only applicable when choice of similarities are available to the user. In this paper, we 
examine reconstructing hierarchical clustering under similarity observations at-random. We derive precise 
bounds which show that a significant fraction of the hierarchical clustering can be recovered using fewer 
than all the pairwise similarities. We find that the correct hierarchical clustering down to a constant 
fraction of the total number of items (i.e., clusters sized O (N)) can be found using only 0{NlogN) 
randomly selected pairwise similarities in expectation. 

1 Introduction 

Hierarchical clustering based on pairwise similarities arises routinely in a wide variety of engineering and 
scientific problems. These problems include inferring gene behavior from microarray data [T], Internet 
topology discovery [2], detecting community structure in social networks i3J, advertising |4^, and database 
management [3 [6]. Often there is a significant cost associated with obtaining each similarity value. This cost 
can range from computation time to calculate each pairwise similarity {e.g., phylogenetic tree reconstruction 
using amino acid sequences \T) to measurement load on the system under consideration {e.g., Internet 
topology discovery using tomographic probes [5]). In addition, situations where the similarities require an 
expert human to perform the comparisons results in a significant cost in terms of time and patience of the 
user {e.g., human perception experiments in [4]). 

Prior work has attempted to develop efficient hierarchical clustering methods [51 [3] , but many proposed 
techniques are heuristical in nature and do not provide any theoretical guarantees. Derived bounds are 
available in [TU] which robustly finds clusters down to O (log TV). The main limitation to this approach is 
that it is only applicable for problems where the user has control over the specific pairs of items to query and 
when the pairwise similarities are acquired in an online fashion {i.e., one at a time, using past information 
to inform future samples). In many situations, either this control is not available to the user, or a subset 
of similarities are acquired in a batch setting {e.g., recommender systems problems |11|). This motivates 
resolving the hierarchical clustering given a selected number of pairwise similarities observed at-random, 
where adaptive control is not available. Specifically, we look to answer the following question: How many 
similarities observed at-random are required to reconstruct the hierarchical clustering? The work in [121 
developed a novel clustering technique to approximately resolve clusters down to size O {N) from random 
observations, in contrast we look to reconstruct the clustering hierarchy (to some pruning) exactly with high 
probability and also examine discovering clusters of size significantly less than O {N) . 

While results in [TU] indicate that resolving the entire hierarchical clustering using a sampling at-random 
regime requires effectively all the pairwise similarities, we find that a significant fraction of the clustering 
hierarchy can be resolved accurately. Specifically, we resolve the similarity sampling rate required given a 
desired level of clustering resolution. The only restriction we will place on the observed pairwise similarities 
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are that they satisfy the Tight Clustering (TC) condition, which states that intracluster similarity values are 
greater than intercluster similarity values. This sufficient condition is required for any minimum-linkage clus- 
tering procedure, and can commonly be found underlying branching processes where the similarity between 
items is a monotonic increasing function of the distance from the branching root to their nearest common 
branch point (such as clustering resources in the Internet |13)). or when similarity is defined by density-based 
distance metrics [H] . 

Our results show that the hierarchy down to clusters sized a constant fraction of the total number 
of items (i.e., O (N)) can be resolved with only O(A^logA^) pairwise similarities observed at-random on 
average. To find smaller clusters of size 0{N'^), where /S € (0,1), we derive bounds which show that only 
O (A^^^^logA^) similarities are required in expectation. 

The paper is organized as follows. The hierarchical clustering problem and problems associated with 
randomly selected pairwise similarities are introduced in Section [51 Our derived bounds are presented in 
Section [3l Finally, concluding remarks are made in Section |4l 

2 Hierarchical Clustering and Notation 

Let X = {xi,X2, ■ . ■ , xn} he a collection of iV items which has an underlying hierarchical clustering denoted 
as T. 

Definition 1. A cluster C is defined as any subset of^. A collection of clusters T is called a hierarchical 
clustering ifUdeTCi — X and for any Ci,Cj G T, only one of the following is true (i) Ci C Cj, (ii) Cj C Ci, 
(Hi) nCj = 0. 

Without loss of generality, we will consider T as a complete binary tree, with N leaf nodes and where 
every Ck E T that is not a leaf of the tree, there exists proper subsets d and Cj of Ck, such that dCiCj =0, 
and Ci U Cj = Ck ■ 

Our measurements will be from S = {sij} the collection of all pairwise similarities between the items 
in X, with Sij denoting the similarity between Xi and Xj and assuming Sij — Sj^i. The similarities must 
conform to the hierarchy of 7" through the following sufficient condition. 

Definition 2. The triple (X, 7", S) satisfies the Tight Clustering (TC) Condition if for every set of 
three items {xi^Xj,Xk} such that Xi,Xj G C and Xk ^ C, for some C G T, the pairwise similarities satisfies, 
Si,j > max{si^k,Sj^k)- 

In words, the TC condition implies that the similarity between all pairs within a cluster is greater 
than the similarity with respect to any item outside the cluster. Under the TC condition, the tree found 
by agglomerative clustering will match the true clustering hierarchy, T. Minimum-linkage agglomerative 
clustering |15j is a recursive process that begins with singleton clusters (i.e., the N individual items to 
be clustered). At each step of the algorithm, the pair of most similar clusters associated with the largest 
observed pairwise similarity are merged. The process is repeated until all items are merged into a single 
cluster. The main drawback to this technique is that it requires knowledge of all O {N'^^ pairwise similarities 
value {i.e., all values must be known to find the maximum), therefore this methodology will be infeasible for 
problems where N is large, or where there is a significant cost to obtaining each similarity. 

To reduce the measurement cost, we consider an incomplete observation of pairwise similarities. For our 
specific model, we define the indicator matrix of similarity observations, il, such that flij = 1 if the pairwise 
similarity Sij has been observed and flij = if the pairwise similarity Sij is not observed {i.e., unknown). 
The pairwise similarities are observed uniformly at-random, defining the similarity observation matrix as, 

p{n,^j = i)^p ■.yi,j (1) 

For some probability, p > 0. 

To reconstruct the clustering from incomplete measurements, this paper focuses on a slightly modified 
version of minimum-linkage agglomerative clustering. This process can be considered off-the-shelf agglomer- 
ative clustering where the pairwise similarities not observed are simply ignored {i.e., unobserved similarities 
are zero- filled). The methodology is described in Algorithm [TJ 
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Algorithm 1 - incompleteAgglomerative(X, S, O) 
Given : 

1. Set of items, X = {xi,X2, ...xn}- 

2. Matrix of pairwise similarities, S. 

3. Indicator matrix of observed pairwise similarities, fl. 
Clustering Process : 

Initialize clustering tree T = {Ci,C2, ...,Cn}, where Cj = Xi for all i = {1, 2, ...N}. 
Zero-fill unobserved pairwise similarities, Sij = if = 0. 
While maxi \Ci\ < N 

1. Find i,j = argmaxj^j (Sij) 

2. Merge the items i,j in the hierarchical clustering T, and matrix of pairwise similarities S. 

(a) Set S. ,^ = m&xiSi ,^, 5j J for all k = {1, 2, N}. 

(b) Set Sj ,^ = for all k = {1, 2, N}. 

(c) Set C^={C^,Cy} 

(d) SetCy = {} 

Output : 

Return the estimated hierarchical clustering, T. 



2.1 Canonical Subproblem 

The intuition for why an incomplete subset of pairwise similarities is useful can be seen when considering the 
canonical subproblem where a single cluster C is split into two subclusters Cl,Cr (where C = Cl[JCr and 
ClC\Cr, = 0). In order to properly resolve the two clusters, it is necessary for the agglomerative clustering 
algorithm to have enough pairwise similarities to make an informed decision as to which items belong to 
which subcluster. We define a sampling graph, Q, resolved from the pairwise similarity observation matrix 
Q. 

Definition 3. Consider the N x N pairwise sirnilarity observation matrix Q., then the sampling graph 
Q = {V,E} is defined as graph with |V| = iV nodes where the edge eij = 1 if ^ij = 1 (i.e., the pairwise 
similarities was observed) and aj = otherwise fi.e., the pairwise similarity was not observed). 

In the context of the sampling graph, Q, we can state the following proposition with respect to resolving 
the canonical subproblem. 

Proposition 1. Consider a cluster C consisting of two subclusters, Cl and Cr (such that Cl{JCr = C and 

ClC\Cr = 9). Then, the agglom,erative clustering algorithm will resolve the two subclusters if and only if the 
sampling subgraphs associated with each subcluster (i.e., Ql for cluster Cl and Qr for cluster Cr) are both 
connected. 

The intuition behind this proposition is as follows. Any clustering procedure requires enough information 
to assocdatc each item with the cluster it belongs to. For example, minimum-linkage agglomerative clustering 
would require that each item observe at least a single similarity with another item in that cluster. But this 
is not enough, as we also require that one of these two items have an observed similarity with another item 
in the remainder of the cluster {i.e., one of the items in the cluster, not including the two items paired 
together), and so on until the all the items can be clustered. In terms of the sampling graph, this is the 
requirement that a path can be found between items in the same cluster (as all of these pairwise similarities 
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will be greater than any other item outside of this cluster, as stated using the TC condition). It is then 
obvious that a cluster of items will only be returned if the sampling graph is connected between those items. 
An example of this clustering can be found in Figure [1] 




(A) (B) (C) 

Figure 1: (A) - Dense sampling graph on two clusters CljCr, where an edge between nodes indicates that 
we have observed the pairwise similarity between those items, (B) - Example connected path found through 
the sampling graph using agglomerative clustering, (C) - Resulting correct hierarchical clustering. 

Alternatively, if a cluster of items is disconnected, into two sampling graph connected components Ca, Cb 
(where Ca[JCb = C and CaC\Cb = 0), then the clustering procedure will not have enough information to 
merge the items into a single cluster. An example of this incorrect clustering can be found in Figure [2l 

(I^j (te^l ^c^, 

(A) (B) (C) 

Figure 2: (A) - Sparse sampling graph on two clusters CljCr, where an edge between nodes indicates that we 
have observed the pairwise similarity between those items, (B) - Example disconnected path found through 
the sampling graph using agglomerative clustering, (C) - Resulting incorrect hierarchical clustering. 



3 Main Results 

When pairwise similarities are observed uniformly at-random, such that each pairwise similarity is observed 
with probability p, the resulting sampling graph can be considered a bernoulli random graph (where each 
edge exists with probability p) . Using Proposition [1] and prior work on random graph theory |16j , we can 
state the following theorem. 

Theorem 3.1. Consider the quadruple (X.,T,S,Q (p)), where the Tight Clustering (TC) condition is sat- 
isfied, and T is a complete (possible unbalanced) binary tree that is unknown. Then, the agglomerative 
clustering algorithm recovers all clusters of size > n > A of T with probability > (l — a) for a > given 
sampling (p) satisifies, 

\ f an \2/n / 1 \2/n / a 

Proof. The proof of this theorem follows from the results of Propositions [5] and [31 □ 

The first component of Theorem 13.11 requires that the sampling probability is large enough that a path 
in the sampling graph can be found with high probability for any collection of items of size > n. 



4 



Proposition 2. Given a set of N > A items, then the agglomerative clustering algorithm will recover all 
< clusters of size > n > 4 of T with probability >(! — §■) given the sampling probability satisfies, 

Proof. We prove this proposition using prior work on random graplr tlreory in the Appendix fSection l5.2p . □ 

While Proposition [2] ensures that there are enough pairwise similarities to determine each leaf cluster 
(down to size n), to resolve the entire tree structure down to clusters of size > n we additionally require 
enough similarities to determine the connectivity between these clusters. 

Proposition 3. Consider the quadruple (X, 7", S, 51 (p)), where the Tight Clustering (TC) condition is sat- 
isfied, and T is a complete (possible unbalanced) binary tree that is unknown. Then, given a set of clusters 
of size > n, the clustering structure of T pruned to cluster size n will be resolved with probability > (l — ^) 
given sampling 57 (p) satisfies. 

Proof. Consider the clustering structure of 7" and a single cluster of size n. At most, there will be iV — n 
other clusters in T that must be compared against to construct the clustering hierarchy. Given sampling 
rate p, then at least one item (out of > n) must observe at least one pairwise similarity with at least one 
item in the other cluster. Therefore, to ensure that every cluster satisfies this with probability > (l — §), 
using the union bound we require the sampling rate to satisfy, 

(I-P)" 

P 

□ 

Combining the results of Propositions [2] and [3l we find the sampling probability rate necessary to ensure 
with high probability {i.e., > 1 — a) that all the clusters of size > n will be resolved, and the clustering 
hierarchy between these clusters can be reconstructed. This is shown in Equation [21 



< 



2 (TV - n) 



> 1 



2{N -n) 



l/n 



3.1 Sampling Rate Required for Given Cluster Sizes 

Using the results from Theorem 13. 1[ we can state the expected number of pairwise similarity measurements 
needed to observe clusters down to a specified level. For clusters of size ~ O (SNf^), where /3 G (0, 1] and 
5 € (0, 1], we find the following, 

Theorem 3.2. Consider the quadruple (X, 7", S, (p)), where the Tight Clustering (TC) condition is satis- 
fied, and T is a complete (possible unbalanced) binary tree that is unknown. Then, agglomerative clustering 
recovers all clusters of size > SN^ > 4 (where /3 € (0, 1] and S G (0, 1]) of T with probability > (1 — a) 

given that the total number of items N > max{4, (^)^^'"^ '^'^^ , (ff ) } and the pairwise similarities are 

sampled at-random with probability, 

p>^N-'^\ogN (5) 



For a constant oversampling factor k > 3. 

Proof. The proof of this theorem can be found in the Appendix (Section [5T2t . □ 
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To see the improvements of these bounds, consider the following simple example. We want to reconstruct 
the hierarchical clustering containing N — 1000 items, specifically recovering all clusters of size > 75 (using 
5 = 0.5, p = 0.66, and 6N^ = 75) with probability > (1 — a) = 0.95. Given oversampling factor k = 3 and 
using the results of Theorem l3.2[ we find that to resolve this resolution of clustering only requires a similarity 
sampling rate oi p > 0.5526, on average observing 276, 020 pairwise similarities. This is a significant savings 
over standard techniques that require the entire set of 499, 500 similarities. 

And finally we consider the ability to find large clusters of size O (N). 

Theorem 3.3. Consider the quadruple (X, 7", S, (p)), where the Tight Clustering (TC) condition is satis- 
fied, and T is a complete (possible unbalanced) binary tree that is unknown. Then, agglomerative clustering 
recovers all clusters of size > SN > 4 (where 6 G (0, 1]) of T with probability > (f — a) given that the num- 
ber of items N > max{4, (D^/'^^^k) ^ ^nd the pairwise similarities are sampled at-random with 
probability. 

For a constant oversampling factor k > 3. 

Proof. The proof follows from Theorem [321 D 

Therefore, we find that to resolve clusters down to size >dN>A (for S e (0, 1]) requires only ^ TV log TV 
randomly chosen pairwise similarities in expectation. For example, to resolve the hierarchical clustering 
down to clusters of size > 100 from a set of 1000 items requires (on average) less than 42% of the complete 
set of pairwise similarities to be observed at-random (given a — 0.05). 

4 Conclusions 

Hierarchical clustering from pairwise similarities is found in disparate problems ranging from Internet topol- 
ogy discovery to bioinformatics. Often these applications are limited by a significant cost required to obtain 
each pairwise similarity. Prior work on efficient clustering required an adaptive regime where targeted mea- 
surements were acquired one-at-a-time. In this paper, we consider the more general problem of clustering 
from a set of incomplete similarities taken at-random. We present provable bounds demonstrating that 
resolving large clusters can be determined with only O(iVlogiV) similarities on average. Future work in 
this area includes developing efficient clustering techniques that are also robust to outlier measurements and 
considering alternative sampling methodologies. 

5 Appendix 
5.1 Lemma [1] 

Lemma 1. Given n > 4 and < g < 1, 

2g"/2 (l + g"^)""' < 20q"/2 (i + g"/2)""' 



Proof. Consider some constant value A > 1, then we want to show that, 

2<z"/2 (l + q'-^y < 2Ag"/2 + qn/2y- 

Rearranging both sides and given that A"^ £ (1, A) for n > 1., 

l + q"^) < A^+g"/2 
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Further rearranging and bounding the log function, 

log (l + - < log (a^) 

< ^log(A) 

n — 1 

q-"(--l] < ^log(A) 
\q J n-1 

If q > i, then this inequality is satisfied as the left-hand term is always negative and the right-hand term is 
always positive. 

Considering q < ^, 

q-/^(l-l]<q^ < ^log(A) 



q J n — 1 



Taking the log of both sides, 

n - 2 



2 

If n > 4, then (f - l) > f . Therefore, 



log q < log log (A) — log (n — 1) 



Tl 

- log q < log log (A) - log (n) 



If 9 < ^5 then logq < log i, therefore, 

71 1 

- log - - log log (A) < - log (n) 
^ + ]^^og\og{X) > ^log(n) 

Solving numerically, we find that this is satisfied for all n > 4 if A = 10. This proves the result. 

□ 

5.2 Proof of Proposition [2] 

We begin by determining the sampling probability, p, necessary to ensure that a single set of n items can be 
clustered. This is equivalent to a bernoulli random graph {Gn,p) of size n and probability p being connected. 
From [TB], we bound this probability as, 

P(5„^p is connected) > ni + g—j -q ^ j - (/"^M (l + J -Ij 

Where q — 1 — p. Simplifying in terms of n, and using g"^^ < g"/^ (for all n > 2), 

ri/2 



P {Gn,p is connected) > 1 - ( 1 + q~ ) [q''^ + q''' ) + g" 

1-2 \ ^i^l 



ri-1 



> 1 - 2g"/2 ( 1 + 



+ q 

Bounding the middle term using Lemma [T] (found in the appendix, Section 15. ip , we then find that for 
all n > 4, 

P {Gn^p is connected) > 1 - 20q''^^ (l + q^''^Y ^ + g"/^ 
> 1 - g"/2 Uq f 1 + g"/^)""^ - 1 
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Considering the entire set of N items, there will be at most ^ leaf clusters to resolve (where each cluster 
has > n items). Therefore, we bound the probability that Qn,p is disconnected as. 



(^20 (i + g"/2y' 



an 

< 

- 2N 



Where, using the union bound, we state that all < — clusters containing > n items will be connected with 
probability > 1 — §. 

Bounding g"/^ < C, with C > being a dummy variable, then 

(^20 (l + g"/^) - l) < C (^20 (l + g"/^)""' ' ^) ^ ^ 

Therefore, we can solve with respect to the probability of observation, p, and dummy variable C that, 

< c 

p > l-C^/" (7) 

And, 

c(20(: + ,-)"-'-i) , II 

Then rearranging these terms with respect to q, 

2/n 



1-1 



an 1 



P > 1-1 (77^ + 4t) -1) (8) 



In terms of the probability of pairwise similarity observation {p, where p = I — q), 

^40iVC ' 20 

Combining the bounds in Equations [7] and [3 we find that for a given choice of C > 0, 

We note that for any choice of a, n, N, the first term is monotonically decreasing in C, while second term is 
monotonically increasing in C. To simplify the analysis, we take, 

C=^ (10) 
787V ^ ' 

Plugging this value of C into Equation [S] gives us the result. 
5.3 Proof of Theorem 13:2] 

Using Theorem 13. 1[ we find the required probability of similarity observation for a given n,N. Given that 
n > 6N^ > 4, we can state, 

" ^ °'°|'-(w) .l-(2=-T_i) \,-{___^ I ,„) 

We now show that the specified sampling probability rate, '^N~P logiV, is greater than all terms inside 
this bound. 
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5.3.1 First Term Derivation 

The first term of Equation [TT] is satisfied if, 



2k o, faSNi^\ 



5 V 787V J 

■ loa- 1 - —N^^ IorN] < loi 



2^V ^ ^ J - ^\ 78 

Bounding the logarithm function, we can state that log (l — '^N^^ log A^) < ^-^N^^ logiV. Therefore, 

(5iV^ /-2k _o. \ . /aSN^- 



f-2K . \ fa6N^-^\ 



Then, 

' aS 
78 



(l-^-K)logiV < log 



Therefore, this bound holds if TV > (f|)^^^^ ^ 



5.3.2 Second Term Derivation 

The second term, 



s 



N-l^logN > 1-(2^^^~1^ 



SN^ , / 2k 
■lo: 



2 

Again, bounding the log function, 

SNf' ( 2k 



g( 1- -^N-'^ log Nj < log(2^^^-l) 



Y^^^ log ^1 ^ log ' 2^^^ - 1 



-KlogA^ < log (^2''"^-i - 1^ 
We can then lower bound the term log ^2«"''-i — 1^ . Given that 2««'^-i - 1 e (-i-, l) for any (5A^^ > 1, 



lot 



^ (^2^^ ~ 2) 



N log (N) 



N - 
Then, 



-K log TV < 



7Vlog(iV) 



iV- 

And rearranging this term, 

2N N 
~K + r < 



111 (2^^^ - 2) 



N -1 - N ■ 

Given that the right hand side is always greater than zero, we can then find that the second term holds if, 

2iV 

K > 



N ~1 

Therefore, the second term holds if At > 3 and TV > 4. 
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5.3.3 Third Term Derivation 



And finally, the third term can be bounded as, 



2k 



5 ^ - \2{N-5NP) 

{5N^^) log ( 1 - ^7V-^ log n\ < log ^ " 



1/(5;^) 



5 ) ~ '^\2{N-6NI^) 



Bounding the log function. 



{-6N^)fN-^logN = -2KlogN < log (^p^^^) 



Then, 



2{N- 5NI^) 



Ar-2« (AT _ SNl^) = iVi-2« (1 - 6Nl^-^) < iVi-^-^ < I 

Then we find that this term is satisfied if A/' > (^)^^^^ ^"^^ 

By combining all three bounds, we find that sampling with probability ^N~^ log A'' will resolve a pruning 
of the hierarchical clustering down to clusters of size > 5N0 > 4, if K > 3and AT > max{4, {^f<-'-^^^ , <-'-^-^^}. 
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